# Multimodal Pre-training
Yoloe 11l Seg
YOLOE is a real-time visual omni-model that supports various vision tasks including zero-shot object detection.
Object Detection
Y
jameslahm
219
2
Yoloe V8l Seg
YOLOE is a real-time visual omni-model that combines object detection and visual understanding capabilities, suitable for various visual tasks.
Object Detection
Y
jameslahm
4,135
1
Yoloe V8s Seg
YOLOE is a zero-shot object detection model capable of detecting various objects in visual scenes in real-time.
Object Detection
Y
jameslahm
28
0
Aimv2 Huge Patch14 224.apple Pt
AIM-v2 is an efficient image encoder implemented based on the timm library, suitable for image feature extraction tasks.
Image Classification
Transformers

A
timm
93
0
Aimv2 3b Patch14 224.apple Pt
AIM-v2 is an efficient image encoder model compatible with the timm framework, suitable for computer vision tasks.
Image Classification
Transformers

A
timm
50
0
Vit So400m Patch14 Siglip 378.webli
Apache-2.0
A vision Transformer model based on SigLIP, containing only an image encoder, utilizing the original attention pooling mechanism.
Image Classification
Transformers

V
timm
82
0
Vit Large Patch16 Siglip Gap 384.webli
Apache-2.0
A vision Transformer model based on SigLIP, utilizing global average pooling, suitable for image feature extraction tasks.
Image Classification
Transformers

V
timm
13
0
Vit Base Patch16 Siglip 384.webli
Apache-2.0
Vision Transformer model based on SigLIP, containing only the image encoder part, using original attention pooling mechanism
Image Classification
Transformers

V
timm
64
1
Vit Base Patch16 Siglip 224.webli
Apache-2.0
Vision Transformer model based on SigLIP, containing only the image encoder part, using original attention pooling mechanism
Image Classification
Transformers

V
timm
330
1
Vit Large Patch14 Clip 224.laion2b
Apache-2.0
Vision Transformer model based on CLIP architecture, specialized in image feature extraction
Image Classification
Transformers

V
timm
502
0
Aimv2 Large Patch14 Native Image Classification
MIT
AIMv2-Large-Patch14-Native is an adapted image classification model, modified from the original AIMv2 model to be compatible with Hugging Face Transformers' AutoModelForImageClassification class.
Image Classification
Transformers

A
amaye15
15
2
Vit Base Patch32 Clip 224.metaclip 400m
A vision-language model trained on the MetaCLIP-400M dataset, supporting zero-shot image classification tasks
Image Classification
V
timm
2,406
0
Vit Base Patch32 Clip 224.laion2b E16
MIT
Vision Transformer model trained on the LAION-2B dataset, supporting zero-shot image classification tasks
Image Classification
V
timm
7,683
0
Openclip Resnet50 CC12M
MIT
OpenCLIP model based on ResNet50 architecture and trained on the CC12M dataset, supporting zero-shot image classification tasks.
Image Classification
O
thaottn
13.67k
0
Wav2vec2 Base Audioset
Audio representation learning model based on HuBERT architecture, pre-trained on the complete AudioSet dataset
Audio Classification
Transformers

W
ALM
2,191
0
Test2
Apache-2.0
FoodSeg103 is a dataset containing 7,118 food images, annotated with 104 ingredient categories, with an average of 6 ingredient labels and pixel-level masks per image.
Image Segmentation
Transformers

T
mccaly
22
1
Eva Giant Patch14 Clip 224.laion400m S11b B41k
MIT
A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks
Text-to-Image
E
timm
459
1
Eva02 Large Patch14 Clip 336.merged2b S6b B61k
MIT
EVA02 is a large-scale vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks.
Text-to-Image
E
timm
15.78k
0
Pix2struct Base
Apache-2.0
Pix2Struct is an image encoder-text decoder model trained on various image-text pairs for tasks including image captioning and visual question answering.
Image-to-Text
Transformers Supports Multiple Languages

P
google
6,390
71
Chinese Clip Vit Large Patch14 336px
Chinese CLIP is a simplified implementation of CLIP based on approximately 200 million Chinese image-text pairs, using ViT-L/14@336px as the image encoder and RoBERTa-wwm-base as the text encoder.
Text-to-Image
Transformers

C
OFA-Sys
713
23
Taiyi Stable Diffusion 1B Chinese EN V0.1
Openrail
The first open-source Chinese-English bilingual Stable Diffusion model, trained on 20 million filtered Chinese image-text pairs
Text-to-Image Chinese
T
IDEA-CCNL
182
106
Xclip Base Patch16 Ucf 2 Shot
MIT
X-CLIP is a minimalist extension of CLIP for general video-language understanding. The model is trained on (video, text) pairs through contrastive learning.
Text-to-Video
Transformers English

X
microsoft
51
1
Layoutlmv3 Large Finetuned Funsd
A fine-tuned version of the LayoutLMv3-large model based on the FUNSD dataset, specializing in document intelligence understanding tasks
Text Recognition
Transformers

L
HYPJUDY
66
5
Layoutlmv3 Base Finetuned Funsd
A document AI model fine-tuned on the FUNSD dataset based on the LayoutLMv3-base model, designed for form understanding tasks
Text Recognition
Transformers

L
HYPJUDY
329
4
Layoutlmv2 Large Uncased Finetuned Vi Infovqa
A document visual question answering model fine-tuned based on microsoft/layoutlmv2-large-uncased, suitable for Vietnamese information extraction tasks
Text-to-Image
Transformers

L
tiennvcs
16
0
Bros Large Uncased
BROS is a pre-trained language model focusing on text and layout, designed to better extract key information from documents.
Large Language Model
Transformers

B
naver-clova-ocr
55
6
Gpt2 Chinese Poem
A Chinese classical poetry generation model based on the GPT2 architecture, pre-trained by UER-py, capable of generating Chinese classical poetry.
Large Language Model Chinese
G
uer
1,905
38
Markuplm Large Finetuned Qa
This model is a question-answering model fine-tuned based on Microsoft's MarkupLM architecture, specifically designed for handling Q&A tasks combining web markup languages (HTML/XML) with text.
Multimodal Fusion
Transformers

M
FuriouslyAsleep
50
1
Featured Recommended AI Models